## [1] "/Volumes/LACIE SETUP/Data_Science/Udacity_DataAnalyst_NanoDegree_Projects/Exploratory_Data_Analysis/Exp_Summ_Data"
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
I chose the Prosper Loan data set and am a bit out of my element here, as I’m not quite sure what questions to ask? And for the record, that’s a first. This loan data set has over 113,000 observations and 81 variables to explore. I don’t know where to start, but I think that’s ok. I’ll let the process of exploratory data analysis lead my investigation to explore a chosen set of variables and their relationship with one another.
After taking a sneak peak at the data set, I noticed quite a few factor variables. Factor variables are just categorical variables that use data objects to categorize the data and store it as levels; these categorical data objects can be either strings or integers. I’ll explore these types of variables first to see if my exploration uncovers any patterns or anomolies that might lead my journey down an unexpected path. From there, anything interesting that I find or stumble upon should warrant further exploration, given the time.
That being said, I think it would interesting to analyze the exploratory factor variables and see if there is any relationship with what we as consumers care about most, our interest rate on the loan, or in this case the exploratory variable ‘BorrowerRate’.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
## [1] 0.0000 0.4975
This particular plot analyzes the distribution of borrower rates within the data set. Based on this plot summarization and the output of the range() function on the variable ‘BorrowerRate’, we can confirm and see the lower limit of borrower rate at 0.0%, and the upper limit at 0.4975%.
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
## # A tibble: 12 x 8
## LoanStatus borrower_rate_m… borrower_rate_m… borrower_rate_m…
## <fct> <dbl> <dbl> <dbl>
## 1 Cancelled 0.184 0.2 0.108
## 2 Chargedoff 0.235 0.24 0.01
## 3 Completed 0.186 0.174 0
## 4 Current 0.184 0.176 0.0577
## 5 Defaulted 0.223 0.230 0
## 6 FinalPaym… 0.197 0.190 0.0629
## 7 Past Due … 0.253 0.255 0.145
## 8 Past Due … 0.231 0.232 0.0749
## 9 Past Due … 0.235 0.242 0.0599
## 10 Past Due … 0.233 0.247 0.0649
## 11 Past Due … 0.240 0.247 0.0659
## 12 Past Due … 0.238 0.250 0.0766
## # ... with 4 more variables: borrower_rate_max <dbl>,
## # monthly_income_mean <dbl>, loan_amount_mean <dbl>, n <int>
This particular plot aggregates the counts for each level of ‘LoanStatus’. As a factor variable, we can clearly see the count variance between levels. Based on the table() function, which aggregates our data, we can see that the level ‘Current’ has the largest count with 56,576 observations, while ‘Cancelled’ has only 5 observations, and lastly, 205 lucky souls have their ‘FinalPaymentInProgress’. Let’s take a look at the loan stats by loan status data frame I created and see how that plots out. It would be interesting to see the mean variance between loan status levels for some of the exploratory variables I’ve created in the ‘lnStats_by_LoanSts’ data frame.
These above plots are related to the ‘lnStats_by_LoanSts’ data frame created to further explore the Loan Status variable. Our first of three, we have the borrower rate mean grouped by loan status. The plot itself however just shows the counts of borrower rate means. The second plot is the average monthly income per category of ‘LoanStatus’, but it too only shows the counts of monthly averaged incomes. The final plot however, goes back to our plot of the borrower rate mean for each level in ‘LoanStatus’, except here, we are adding the hue of, or filtering by factor(categorcial) parent variable, ‘LoanStatus’. It would be interesting to see which level in ‘LoanStatus’ has the lowest median borrower rate?
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
## # A tibble: 9 x 5
## EmploymentStatus borrower_rate_mean mthly_inc_mean list_cat_median n
## <fct> <dbl> <dbl> <dbl> <int>
## 1 "" 0.186 5165. 0 2255
## 2 Employed 0.193 6139. 1 67322
## 3 Full-time 0.187 5043. 1 26355
## 4 Not available 0.191 4555. 0 5347
## 5 Not employed 0.244 197. 3 835
## 6 Other 0.214 3568. 1 3806
## 7 Part-time 0.184 1640. 1 1088
## 8 Retired 0.194 2987. 2 795
## 9 Self-employed 0.202 6338. 1 6134
This plot conveys the employment status of the borrower at the time they posted the listing, or application for a loan. We can see here that the level ‘Employed’ has the largest count or ‘max’ count, with 67,322 observations and ‘Not employed’ the least count or ‘min’ count, with 835 observations. It would be interesting to further explore if any of these unemployed applicants were approved for their loan. Or better yet, check out the variables I’ve created with a couple plots. Maybe, if I have time…
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
This is an interesting plot. If you’re not careful, you can be fooled by what you see, which isn’t incorrect. However, it is misleading. Here we have a plot that attempts to convey the counts of levels within the variable ‘IncomeRange.’ It even looks normally distributed, but it’s not. The levels in income range are not ordered in ascending order( the $100,000+ income range is the 3rd variable in the sequence, but the last variable within the factored levels). Let’s try plotting the ordred factor variable, income range and see how our plots differ.
## [1] "$0" "$1-24,999" "$100,000+" "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed" "Not employed"
That’s better. Now we can the true distribution of loanees with a given income range. I wonder what the level of correlation is between income range and borrower rate? That may be something I’ll have to explore later on.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 4.00 6.00 5.95 8.00 11.00 29084
##
## 1 2 3 4 5 6 7 8 9 10 11
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456
The Prosper score is a custom risk score built using historical Prosper data. The score ranges from 1 - 10, with 10 being the best, or lowest risk score. Here we have a count of all the observations within the data set filtered by their ‘ProsperScore’. This looks like a normal distribution of ‘ProsperScore’, with a majority of observations categorized with a score of 4 - 8. The subsequent plot zooms in on Prosper scores between 3 and 9. By doing so, we’re able to clearly see the variance in counts between these Prosper scores.
## False True
## 56459 57478
The plot here shows the distribution of applicants who were classified as homeowners vs. applicants who are not homeowners. Home ownership within this data set is defined as having a mortgage on their credit profile at the time of submitting their loan application, or they were able to provide documentation confirming homeownership. It seems pretty clear the distribution of the data set is almost even. Based on the summary report function ran on the variable ‘IsBorrowerHomeowner’, we see non-homeowner applicants with a count of 56,459 and a homeowner applicant count of 57,478.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 699.0 704.6 739.0 899.0 591
##
## (0,50] (50,100] (100,150] (150,200] (200,250] (250,300] (300,350]
## 0 0 0 0 0 0 0
## (350,400] (400,450] (450,500] (500,550] (550,600] (600,650] (650,700]
## 1 41 1041 3067 6084 16371 48329
## (700,750] (750,800] (800,850] (850,900] (900,950]
## 22190 13874 1976 239 0
This was a cool exploratory variable to analyze. After looking at the data set I recognized that all the credit scores had a range of 19, meaning for all borrowers in the data set, their ‘CreditScoreRangeLower’ limit was just 19 points from their ‘CreditScoreRangeUpper’ limit. Realizing this, I thought it would be better to assess the data if I created a credit range bucket exploratory variable I could use to filter on. After doing so, I generated the above three plots.
Plot one is just a histogram distribution plot of the ‘CreditScoreRangeLower’ limit of the data set. Plot two looks almost exactly like it, and it should. This is a plot of the ‘CreditScoreRangeUpper’ limit, which is only 19 points from the lower limit, so in essence one should expect to see a similar distribution. The third plot however, is an exploration of the ‘CreditRange.bucket’ exploratory variable I created. Here, I’ve used a histogram to convey the distribution of true credit range counts within the data set.
## [1] "" "A" "AA" "B" "C" "D" "E" "HR" "NC"
## A AA B C D E HR NC
## 84984 3315 3509 4389 5649 5153 3289 3508 141
The first ‘CreditGrade’ plot is just an aggregation of the factor variable within the data set. Credit Grade has 9 levels, ’ ‘, ’A’, ‘AA’, ‘B’, ‘C’, ‘D’, ‘E’, ‘HR’, & ‘NC’. Taking a look at the summary output of this variable, we see that the ’ ’ level has the highest count at 84,984. I’m curious to see what classification ’ ’ really stands for. Is it missing data, or perhaps, an actual grade of credit? I’ll have to dig a bit further to find out, we’ll see. ‘NC’ I believe stands for no credit, and this grade has a count of 141 within our data set. I’ll have to confirm that ‘NC’’ means no credit, however, just to be thorough.
After reviewing the text file that comes with the dataset, I was able to verifty that ‘CreditGrade’ ratings are only applicable to listings created prior to 2009. Since ‘CreditGrade’ is only applicable to listings pre-2009 we are going to subset our data using the supporting exploratory variable, ‘ListingCreationDate’ to further explore the count distribution within the variable ‘CreditGrade’. This manipulation is visualized within the second plot.
## AK AL AR AZ CA CO CT DC DE FL GA
## 5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008
## HI IA ID IL IN KS KY LA MA MD ME MI
## 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 2318 2615 787 330 3084 52 674 551 3097 472 1090 6729
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207
## WA WI WV WY
## 3048 1842 391 150
This is a great plot of the ‘BorrowerState’ variable. It does a great job aggregating the total counts of loans per state. Im hoping with this plot we can see if there are any patterns or anomolies to the distribution of loan counts by state within the data set. California seems to have the highest count of borrowers with 14,717, and North Dakota the fewest with 52.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 2.774 3.000 20.000
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 16965 58308 7433 7189 2395 756 2572 10494 199 85 91 217
## 12 13 14 15 16 17 18 19 20
## 59 1996 876 1522 304 52 885 768 771
Based on a summary of this exploratory variable, we can see that ‘debt consolidation’ has the highest count for ‘ListingCategory..numeric.’ which is 1 or ‘Debt Consolidation’. The listing category with the fewest or smallest count of observations is 12 or ‘Green Loans’.
This is an interesting plot which I hope to explore further. The factor variable ‘LisitingCategory..numeric.’ is currently displayed as a numeric factor variable. To give you a better idea of what I’m talking about, take a look; listed below will be the numeric assignment of categorized variables assigned to the ‘ListingCategory..numeric.’ factor variable based on the text file provided with the data set:
In my further exploration of this variable, I’d like to convert this numeric factor variable into a factor variable with a bit more meaning behind it. Instead of numbers as levels to identify the category, I’ll reassign the variable to show its true category instead of a number. This will help reading plots for this variable much more discernable.
I’d also like to see if any patterns within this variable exist. Are loan approvals more associated with a particular listing category vs. another? We can dig into later.
The original data set has the variable ‘ListingCategory..numeric.’ as an integer, which means when plotting, you cant filter on it. This was unfortunate because it makes a great factor variable as it is categorical in nature, just conveyed within the data set as an integer. I chose to change this. Therefore, using the cut() function, I created a factor variable ‘ReasonForLoan’. By doing so, I can now show/filter by ‘ReasonForLoan’ to show a discernable count of which loan reason is the most prevelent in the data set. Here we can see, ‘debt consolidation’ is the most stated, pun intended, ‘Reason for a Loan’. I have to say though, I’ve always found it odd and a bit hypocritical that social convention’s preferred method of getting out of debt, is taking out another loan. Ehhh….
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
Here we can see the distribution of ‘BorrowerAPR’ rates using the scatter plot technique. Based on the summary statistics generated using the summary() function, we can see a minimum ‘BorrowerAPR’ rate of 0.00653% and a max of 0.51229%. The mean or average ‘BorrowerAPR’ rate for the data set is 0.21883%. An interesting thought,.. I wonder what the variance in distribution would look like between the data set ‘BorrowerAPR’ mean and say the average ‘BorrowerAPR’ rate for the data set grouped by ‘CreditRange.bucket’, the exploratory variable I created earlier? That may need to be left for future explorations.
This plot conveys the distribution of the ‘EstimatedEffectiveYield’ exploratory variable using the scatter plot method. The first plot depicts a bit of overplotting near the x-axis. To further investigate this, I’ve added a log10() function manipulation to the y-axis using scale_y_log10(). By doing so, we can see in the second plot, a much clearer distribution of plot points under a 1000, and also a better view at the overplotting near the x-axis.
Here we have analyzed the exploratory variable ‘MonthlyLoanPayment’. This distribution of values within the data set are conveyed using the scatter plot method in figure 1. We can see quite a bit of overplotting near the x-axis and as the count of borrowers with the same monthly loan payment increases.
I was curious to explore the proportion of monthly loan payments to the original amount of the borrower’s loan so I created an exploratory variable to do so. The variable ‘prop_MthlyPymt’ was developed by dividing the borrower’s monthly loan payment by their original loan amount. In figure 2, I used a histogram to show the distribution of proportion monthly payments, ‘prop_MthlyPymt’ to convey the count of borrowers with similar proportion percentages.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
By plotting the variable ‘Loan Original Amount’, we are able to see high-level the distribution of loan amounts through out the data set. In figure 1’s plot we can see a lot of overplotting near the x-axis, where a majority of the observations of loan amounts carry similar values, leading to the overplotting. I’m hoping a further exploration of the y-axis parameters will show a better variance among the loan amounts. For now, running the summary function on the variable ‘LoanOriginalAmount’ produces a minimum loan amount within the data set of $1,000, a mean of $8,337, a median of $6,500, and a max loan amount of $35,000.
After further analysis, we can see in figure 2, by taking the square root of the exploratory variable ‘LoanOriginalAmount’ the distribution scale on the x-axis has shrunk from a visible limit of 30,000, to a little more than 150. – To give some meaning to this, 150 squared is 22,500, just a few shy of the x-axis limit in figure 1.– That being said, we can see the value distribution to be much more spread out. A further exploration in figure 3, shows more dispersion amongst the values of the exploratory variable ‘LoanOriginalAmount’.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.00 16.00 22.93 33.00 141.00 91852
The variable ‘TotalProsperPaymentsBilled’, refers to the number of on time payments the borrower made on Prosper loans at the time they created the loan application. This value will be null if the borrower had no prior loans. By running our summary function, I was able to determine that 91,852 applicants had no prior loans with Prosper and where therefore categorized as NA’s within this exploratory variable. This means that 80.6% of this data set of loanees did not have an existing loan with Prosper before filling out an application.
It would be interesting to find out the distribution of "BorrowerRates’ for applicants in this bucket, who’ve had no prior loan exprience with Prosper. And does prior loan experience with Prosper improve an applicants chances of having a better Borrower Rate ?
Taking a closer look at the exploratory variable ‘TotalProsperPaymentsBilled’ in figure 2, I’ve taken the square root of ‘TotalProsperPaymentsBilled’, and used a line plot to show its distribution. For figure 3, I’ve taken the log of ‘TotalProsperPaymentsBilled’, using log10() to further analyze and stretch the visual.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 0.61 0.00 42.00 91852
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 18285 1810 660 329 205 162 95 95 48 60 45 36
## 12 13 14 15 16 17 18 19 20 21 22 23
## 25 26 22 28 20 25 9 10 9 11 7 6
## 24 25 26 27 28 29 30 31 32 33 34 35
## 11 9 11 2 5 2 2 1 2 1 3 1
## 36 39 40 41 42
## 1 1 1 1 3
Here we can see a plot of the aggregated count of loanees who have payments less than a month old. As you can see, this is a right-tailed distribution with a majority of payments less than a week old. What’s really interesting here is that the variable, ‘ProsperPaymentsLessThanOneMonthLate’ has a range of 0 to 42 as their defined month. I’d be curious to see what proportion of loanees have late payments under a month late but beyond the socially conventioned understanding of how long a calender month actually is (0 to 31 days). That being said, it looks like Prosper has quite a few responsible customers paying on time, as the max count for payments 0 days late is 18,285.
I tried zooming in on Prosper payments beyond the socially conveined paramater of how long a month actually is, 31 days, to get an idea of how many borrowers or what ratio of borrowers might occupy this bucket. In figure 2, we can see that 14 borrowers or .000634% of the populatioin have payments less than one month old but beyond the 31 day threshold.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 0.05 0.00 21.00 91852
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 21700 185 69 49 20 16 9 16 5 3 1 3
## 12 16 18 19 21
## 1 1 1 3 3
Here, the plot of variable ‘ProsperPaymentsOneMonthPlusLate’ shows an aggregated count of loanees whose payments are more than a month late. It would be interesting to find out if any loanees whose payments where categorized as less than a month late, but fell beyond the 31 day calendar month threshold, are also in this ploted variable.
There are 113,937 observations in this data set with 81 variables. I chose to take a look at a few factor variables to get a better understanding of the data: CreditGrade, LoanStatus, BorrowerState, EmploymentStatus, IsBorrowerHomeowner, and IncomeRange. I also analyzed a few other variables to round out my high-level exploration of the data set. Here’s what I found:
A summary of the entire data set:
The main features of interest for me in this data set are the borrower rate, prosper score, prosper payments less than one month late and prosper payments one month plus late. This may change as my exploration of the data deepens with every new pattern or anomaly I stumble upon. It should be an interesting journey,.. I’m looking forward to it.
investigation into your feature(s) of interest?
In support of my investigation of the variable borrower rate, I’ll also explore the following supportive (I hope :|) exploratory variables:
I hope in my analysis of these features I can unconver some insightful and useful ideas to help improving my own situation outside of the obvious known influencers of borrower rate.
In order to thoroughly assess this data set, and appease my curiosity, I had conduct a few manipulations in order to answer a few intial questions I had about the data. The following variables were created during my exploration:
Parent Feature: Loan Status
Exploratory Variable Created:
* ‘borrower_rate_mean’
* ‘borrower_rate_max’
* ‘borrower_rate_median’
* ‘mnthly_income_mean’
* ‘borrower_rate_min’
* ‘n() = count’
Here I created 6 variables within a data frame by grouping the data by Loan Status using the group_by() function. This generated some basic statistics about the data set grouped by the Loan Status levels. My exploration of some of these exploratory variables can be found above in the Univariate Plot section of this report.
Parent Feature: Employment Status
Exploratory Variable Created:
* ‘borrower_rate_mean’
* ‘mnthly_inc_mean’
* ‘list_cat_median’
* ‘n() = count’
I created 4 variables within a data frame by grouping the data set by employment status using the group_by() function. I didn’t get a chance to explore these features very much so maybe at a future date.
Parent Feature: Credit Score Range Lower & Upper
Exploratory Variable Created:
* ‘CreditRange.bucket’
In this analysis, I created one exploratory factor variable within the data set itself, which will make my journey even more fun because now I can filter my data by a true credit range. My visualization of this exploratory variable only produced counts of borrowers within a particular range. Hopefully when I get to the multivariate plot section, I can see what filtering by this bucket range really looks like.
Parent Feature: Monthly Loan Payment
Exploratory Variable Created:
* ‘prop_MthlyPymt’
Creating the proportion of monthly loan payment to original loan amount (‘MonthlyLoanPayment’ / ‘LoanOriginalAmount’) variable helped me understand the percentage borrowers loan payments were to the amount they borrowed. If I have the time, it would be interesting to see the borrowers with the top 90% of ‘prop_MthlyPymnt’ percentages and compare their borrower rates, aprs, and any other interesting factors. Maybe generating a scatter plot matrix of these variables might help me in understanding their relationships with one another and if any correlation exists.
Parent Feature: Credit Grade
Exploratory Variable Created:
* ‘ListingCreationDate2’
Since ‘CreditGrade’ is only applicable to listings pre-2009, I needed to subset the data using a supportive exploratory variable, ‘ListingCreationDate’. Currently in the original data set, ‘ListingCreationDate’ is a factor variable so I’ll need to convert it to a date before subsetting my data. After doing this, the visualization shows a much clearer depiction in variance of the credit grades.
There weren’t really any unusual distributions within the univariate plots visualized above. There were a few with overplotting, but that could be handled with setting your ‘alpha’, and ‘jitter’ parameters mainly.
form of the data? If so, why did you do this?
When trying to visualize some plots, I found it useful to manipulate the x-axis by adding layers, in particular, the scale_x_continuous(), scale_x_sqrt() and scale_x_log10(). On some occassions I even applied using these same functions to the exploratory variable itself, wrapped within the ggplot aesthetic ‘x =’, to further manipulate and zoom in on busy data.
In order to truly convey the entire picture with regards to visualizing the data, I had to adjust my y-axis on quite a few plots to account for the variance in count. Specifically, I had to use scale_y_sqrt(), scale_y_continuous() and lastly the function scale_y_log10().
Bivariate analysis compares two features or exploratory variables from the chosen data set and visualizes their relationship to one another. Here I’ll be looking to explore many relationships based on some patterns and anomalies I’ve noticed during my univariate exploration of the data set. In order to truly get the best out of my analysis, and avoid flying blind, I’m going to use the ggpairs() function to group sets of features together to explore their correlation with one another and other variables specified in the comparison.
ggpairs() does a great job of conveying the correlation coefficient between two variables, or the positive or negative effect one variable has on the other. Let’s group some exploratory variables together and plot them to see which relationships we should explore further. Before I do that however, considering the size of this data set, I’m going to set the seed of my random sample generated by using the function set.seed(). This will help with reproducible results but more importantly, a sample of the population data allows for more efficient and quicker analysis without taking up too much computer resources. In this section, you will often see … data = prsprLoanData_smpl; this is the sample data set of 30,000 observations that was created to analyze and explore variables and their relationships.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1349 0.1845 0.1937 0.2511 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6300 8323 12000 35000
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and LoanOriginalAmount
## t = -60.82, df = 29998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3413601 -0.3212123
## sample estimates:
## cor
## -0.331324
Here we can see that the exploratory variables ‘BorrowerRate’ and ‘LoanOriginalAmount’ are not very correlated, as Pearson’s Correlation Coefficient shows a -.3322 relationship between the two.
When visualizing this relationship, I chose to use the scatter plot method, but if you notice in figure 1, there is quite a bit of overplotting. In subsequent plots, figures 2 & 3, I chose to use the sqrt() and log10() functions, respectively, on the ggplot ‘x’ variable, ‘BorrowerRate’. I was hoping by taking this approach, I might be able to better see the distribution within the relationship, but not by much. I see a bit more linear overplotting as I further manipulate the ‘x’ variable but thats about it. This is to be expected however, as the Pearson’s correlated coefficient is -.3322, which isn’t very favorable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1349 0.1845 0.1937 0.2511 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6300 8323 12000 35000
##
## Pearson's product-moment correlation
##
## data: EmploymentStatusDuration and TotalProsperPaymentsBilled
## t = 4.1006, df = 5666, p-value = 4.178e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02840193 0.08031705
## sample estimates:
## cor
## 0.05439625
Just a side note but I found it much more effective using a scatter plot to interpret vector style feature data, especially when trying to visualize the relationship. Here, we can see based on Pearson’s correlated coefficient, .0543 that these two variables, ‘EmploymentStatusDuration’ and ‘TotalProsperPaymentsBilled’ are not very correlated with one another.
Taking a further look at the relationship, I applied the sqrt() and log10() functions to the ‘x’ variable, ‘EmploymentStatusDuration’ to get a better view of the distribution of values.
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and (as.numeric(LoanStatus))
## t = -2.9, df = 29998, p-value = 0.003735
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.028051860 -0.005426321
## sample estimates:
## cor
## -0.01674123
## # A tibble: 12 x 8
## LoanStatus borrower_rate_m… borrower_rate_m… borrower_rate_m…
## <fct> <dbl> <dbl> <dbl>
## 1 Cancelled 0.184 0.2 0.108
## 2 Chargedoff 0.235 0.24 0.01
## 3 Completed 0.186 0.174 0
## 4 Current 0.184 0.176 0.0577
## 5 Defaulted 0.223 0.230 0
## 6 FinalPaym… 0.197 0.190 0.0629
## 7 Past Due … 0.253 0.255 0.145
## 8 Past Due … 0.231 0.232 0.0749
## 9 Past Due … 0.235 0.242 0.0599
## 10 Past Due … 0.233 0.247 0.0649
## 11 Past Due … 0.240 0.247 0.0659
## 12 Past Due … 0.238 0.250 0.0766
## # ... with 4 more variables: borrower_rate_max <dbl>,
## # monthly_income_mean <dbl>, loan_amount_mean <dbl>, n <int>
Ok, these are two interesting plots. I chose a histogram to convey the relationship between these variables because one of them is a factor variable, or categorical variable which holds no identified numeric value. So when comparing quantitative variables with qualitative ones, I realized I’d have to use the categorical variable as a filter, in order to compare them.
Because I have chosen to compare quantitative vs. qualitative variables, generating a basic statistical test, like Pearson’s coefficient, just isn’t possible, unless I use the as.numeric() function in R, which assigns integers to the categorical levels of an exploratory variable. If we do this, then we can assess Pearson’s coefficient between these variables, making sure to correctly match positive correlations with the assigned category level.
Here, in figure 1, I chose to use ‘LoanStatus’ as a filter, and generate counts of ’BorrowerRate’ by ’LoanStatus’ category. From the visualization, we can see that ’CharegedOff’ and ’Completed’ loan status’ have the largest counts.
In figure 2, I chose to convey borrower rate means, (an exploratory variable created earlier in my analysis), by ‘LoanStatus’ level. This is a much clearer and easier plot to understand. Here we have a point plot displaying the mean borrower rates for each category of ‘LoanStatus’. It looks like the level ‘Past Due (>120 days)’ has the highest borrower rate mean of all the groups at more than 0.26%.
## prsprLoanData$EmploymentStatus:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1300 0.1780 0.1855 0.2375 0.4975
## --------------------------------------------------------
## prsprLoanData$EmploymentStatus: Employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0450 0.1359 0.1840 0.1928 0.2498 0.3600
## --------------------------------------------------------
## prsprLoanData$EmploymentStatus: Full-time
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1200 0.1724 0.1870 0.2493 0.3600
## --------------------------------------------------------
## prsprLoanData$EmploymentStatus: Not available
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1400 0.1900 0.1915 0.2500 0.3000
## --------------------------------------------------------
## prsprLoanData$EmploymentStatus: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1830 0.2599 0.2441 0.3134 0.3500
## --------------------------------------------------------
## prsprLoanData$EmploymentStatus: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1565 0.2099 0.2137 0.2712 0.3500
## --------------------------------------------------------
## prsprLoanData$EmploymentStatus: Part-time
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1227 0.1690 0.1844 0.2399 0.3500
## --------------------------------------------------------
## prsprLoanData$EmploymentStatus: Retired
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1202 0.1829 0.1944 0.2625 0.3500
## --------------------------------------------------------
## prsprLoanData$EmploymentStatus: Self-employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1400 0.1899 0.2023 0.2695 0.3500
In this exploration of ‘BorrowerRate’ vs. ‘EmploymentStatus’ I’ve run the summary function using the by() method, in order to summarize ‘BorrowerRate’ statistics by ‘EmploymentStatus’ level.
To switch it up a bit, in figure 1, I chose to use the box plot method to convey the relationship between these two exploratory variables. Again, I’ve chosen a quantitative variable in ‘BorrowerRate’ and comparing it with a qualitative variable in ‘EmploymentStatus’, so using Pearson’s correlation coefficient to determine correlation can only be done if I convert the factor variable, ‘EmploymentStatus’ to an integer using as.numeric(); I hope in my future explorations of these variables I can uncover some useful insights.
As for the box plot however, we can see a clear and distinct distribution of the min, max and mean borrower rates for each level of ‘EmploymentStatus’. For figure 2, I chose to take a closer look at employment status ‘Not employed’. Here, with the use of the histogram plot, I’ve been able to convey the distribution of borrower rate counts within the ‘Not Employed’ group.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1349 0.1845 0.1937 0.2511 0.4975
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 164 1894 8564 8082 4476
## $100,000+ Not displayed Not employed
## 4547 2056 217
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and (as.numeric(IncomeRange))
## t = -27.361, df = 29998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1670623 -0.1449814
## sample estimates:
## cor
## -0.1560413
I’ve been waiting for the opportunity to really leverage the technique of facet wrapping using the facet_wrap() method. And I really like it :). Here we can see using this technique, I’ve been able to visualize the distribution of borrower rates within each level of ‘Income Range’. This really helps when comparing the distributions among levels because you can see each level’s dispersion of the data. Just from a glance we can see that income ranges $25,000 - $49,999 and $50,000 - $74,999 have the largest counts for borrower rates between 0.05 - .35.
In figure two, I chose a traditional method in conveying the relationship of borrower rates amongst income range levels using the box plot method as before. At a quick glance, we see the ‘IncomeRange’ with the lowest ‘BorrowerRate’ mean seems to be income range $100,000+. I wonder why that might be? It would cool to compare credit score ranges of these borrowers using the ‘CreditRange.bucket’ exploratory variable I created earlier.
And lastly after running a basic statistical test between these quantitative and qualitative/categorical variables, we can see there isn’t too much correlation between them, -0.156. Instead of looking at income ranges, let’s explore another supporting variable which is quantitative in nature, and a bit more specific when describing income. That variable is ‘StatedMonthlyIncome’.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1349 0.1845 0.1937 0.2511 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3196 4617 5658 6833 1750003
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and StatedMonthlyIncome
## t = -9.4292, df = 29998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06563661 -0.04307160
## sample estimates:
## cor
## -0.05436105
As you can see,this box plot shows the variance of borrower rates based on the stated income of the loanee, which happens to fall into a certain income range. I do notice a few things that might raise a couple more questions however: The categories ‘Not displayed’ and ‘Not employed’ are on opposite sides of the spectrum. ‘Not employed’ borrowers seem to have the lowest borrower rates at less than 0.10%, which seems a bit odd. I may have to look into this further. I wonder if this may be due to a tax credit or something?
However, ‘Not displayed’ borrowers seem to have the highest borrower rates, at more that 0.3% which seems logical from the perspective of Prosper, who has to assume the risk of loaning out the money. It makes sense for them to assign a higher borrower rate, if the borrowers stated monthly income is not verifiable.
What really stands out to me though are the income ranges between $50,000 and $99,999: I would have expected their plots to include outliers within their income range, conveyed as points in the upper limits of ‘StatedMonthlyIncome’. What I also find to be peculiar is that income range $100,000+ has point values plotted in and under the range of $50,000, which seems strange. I’ll have to check on that also.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 2.808 3.000 20.000
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID
## 1486 50 447 235 510 3933 581 430 93 64 1756 1311 89 55 154
## IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND
## 1513 528 283 263 245 620 730 28 942 567 710 210 78 826 19
## NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX
## 200 152 832 107 313 1756 1088 267 491 768 98 286 53 452 1788
## UT VA VT WA WI WV WY
## 244 848 59 806 490 110 36
This is the first plot, I think, where I’ve compared two categorical or qualitative variables together. I’m quite pleased with the plot to be honest, because I wasn’t sure what to expect.
In figure 1, we can see ‘BorrowerState’ is set as the x-axis parameter and visualized within a histogram. By filtering on ‘ReasonForLoan’, I was able to convey reason for loan counts by state. Figure 2 is an interesting plot because I chose to use ‘BorrowerState’ as the y-axis parameter, which is a categorical variable. This approach produced a point plot, indicating which state has at least one count of a particular ‘ReasonForLoan’. And yes, I recognize the plot itself, shows ‘BorrowerState’ on the x-axis, but I had to flip the visualization in order for the ‘ReasonForLoan’ labels to show clearly; I wasn’t too worried about the ‘BorrowerState’ labels because a legend is provided for that.
## prsprLoanData_smpl$ReasonForLoan: Not Available
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0050 0.1300 0.1762 0.1819 0.2333 0.4975
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Debt Consolidation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1329 0.1795 0.1892 0.2432 0.3600
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Home Improvement
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1356 0.1984 0.2008 0.2671 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1349 0.1910 0.2007 0.2699 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Personal Loan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0495 0.1075 0.1595 0.1781 0.2300 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Student Use
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1220 0.1897 0.2051 0.2831 0.3600
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Auto
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1329 0.2059 0.2063 0.2759 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0481 0.1440 0.2190 0.2146 0.2899 0.3600
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Baby&Adoption
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0716 0.1344 0.1819 0.1912 0.2362 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Boat
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0649 0.1396 0.1469 0.1676 0.2139 0.2804
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Cosmetic Procedure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1826 0.2494 0.2332 0.2939 0.3185
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Engagement Ring
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0666 0.1405 0.1774 0.1856 0.2291 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Green Loans
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.0941 0.1975 0.1942 0.2916 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Household Expenses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1679 0.2219 0.2224 0.2925 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Large Purchases
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1335 0.1840 0.1892 0.2407 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Medical/Dental
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0605 0.1559 0.2124 0.2135 0.2705 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Motorcylce
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0649 0.1449 0.2085 0.2114 0.2793 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: RV
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0605 0.1449 0.1774 0.1914 0.2492 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Taxes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0604 0.1551 0.2085 0.2108 0.2712 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Vacation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1466 0.2124 0.2103 0.2712 0.3258
## --------------------------------------------------------
## prsprLoanData_smpl$ReasonForLoan: Wedding Loans
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0649 0.1563 0.2081 0.2058 0.2566 0.3304
I’m starting to sense a pattern here,… It seems I like to compare quantitative variables vs. qualitative ones, with an emphasis on filtering on the quantitative/categorical variable or feature. Here, I’ve chosen to compare ‘BorrowerRate’ vs. ‘ReasonForLoan’, but with a little twist. By using the ‘stat = summary’ and ‘fun.y = mean’ parameters within the geom_histogram() layer of the plot, I’ve been able to summarize borrower rate averages for each level of ‘ReasonForLoan’. The figure here depicts ‘Cosmetic Procedure’ as the ‘ReasonForLoan’ with the highest average borrower rate. ‘Boat’ is the ‘ReasonForLoan’ with the lowest borrower rate average.
## prsprLoanData_smpl$BorrowerState:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0435 0.1299 0.1775 0.1811 0.2300 0.4975
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: AK
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0655 0.1400 0.1819 0.1880 0.2437 0.3177
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: AL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0625 0.1449 0.2085 0.2099 0.2703 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: AR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1495 0.2148 0.2112 0.2656 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: AZ
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0595 0.1386 0.1874 0.1945 0.2500 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: CA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1330 0.1845 0.1936 0.2550 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: CO
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0629 0.1314 0.1795 0.1921 0.2489 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: CT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0590 0.1299 0.1819 0.1899 0.2497 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: DC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1099 0.1621 0.1710 0.2148 0.3200
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: DE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0667 0.1265 0.1722 0.1843 0.2302 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: FL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0495 0.1349 0.1795 0.1925 0.2524 0.3600
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: GA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1395 0.1900 0.1973 0.2574 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: HI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1100 0.1897 0.1818 0.2419 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: IA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0700 0.1239 0.1560 0.1695 0.1926 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: ID
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1434 0.2050 0.2068 0.2604 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: IL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1355 0.1819 0.1944 0.2573 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: IN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0595 0.1362 0.1845 0.1952 0.2552 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: KS
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0599 0.1335 0.1840 0.1949 0.2599 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: KY
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.1349 0.1815 0.1971 0.2599 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: LA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0644 0.1399 0.1845 0.1976 0.2566 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: MA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1239 0.1708 0.1821 0.2306 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: MD
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0481 0.1371 0.1819 0.1945 0.2552 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: ME
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0700 0.1094 0.1425 0.1587 0.1906 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: MI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0050 0.1404 0.1975 0.1998 0.2499 0.3600
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: MN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1362 0.1870 0.1955 0.2506 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: MO
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1449 0.2000 0.2053 0.2685 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: MS
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0725 0.1497 0.1995 0.2031 0.2579 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: MT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0724 0.1421 0.1945 0.2018 0.2660 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: NC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1355 0.1855 0.1953 0.2575 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: ND
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0900 0.1487 0.2000 0.2112 0.2850 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: NE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0643 0.1459 0.1902 0.2020 0.2699 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: NH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0655 0.1363 0.1767 0.1936 0.2489 0.3600
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: NJ
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0519 0.1314 0.1774 0.1907 0.2498 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: NM
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0716 0.1398 0.1800 0.1931 0.2498 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: NV
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0605 0.1399 0.1883 0.1971 0.2511 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: NY
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1269 0.1774 0.1894 0.2514 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: OH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0629 0.1465 0.1943 0.2014 0.2574 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: OK
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0560 0.1363 0.1774 0.1871 0.2379 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: OR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0604 0.1370 0.1979 0.2041 0.2699 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: PA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1399 0.1928 0.2010 0.2669 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: RI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0809 0.1421 0.1995 0.2028 0.2636 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: SC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0590 0.1300 0.1835 0.1930 0.2511 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: SD
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0609 0.1585 0.2049 0.2108 0.2699 0.3304
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: TN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1389 0.1842 0.1975 0.2591 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: TX
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.1299 0.1840 0.1909 0.2500 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: UT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1349 0.1900 0.1962 0.2568 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: VA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1189 0.1774 0.1857 0.2500 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: VT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0590 0.1357 0.1875 0.1959 0.2586 0.3199
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: WA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.1314 0.1845 0.1899 0.2432 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: WI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0550 0.1250 0.1700 0.1837 0.2441 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: WV
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0619 0.1187 0.1787 0.1877 0.2486 0.3500
## --------------------------------------------------------
## prsprLoanData_smpl$BorrowerState: WY
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.1263 0.1920 0.1891 0.2683 0.3177
I really like this plot. It’s probably because of the pretty cool colors it uses but I also think it’s quite informative at high level glance. I’ve chosen to compare ‘BorrowerRate’ vs. ‘BorrowerState’ by visualizing their relationship using the box plot technique. Because we have multiple borrowers from each state, in order to get the average ‘BorrowerRate’, I had to add a layer to the plot using the stat_summary() function and pass it the parameter of fun.y = mean. The result, is this beautiful visualization, which conveys Iowa, IA, as having the lowest ‘BorrowerRate’ amongst the states at around 0.15% and Arkansas with the highest ‘BorrowerRate’ amongst the states at around 0.22%.
##
## Pearson's product-moment correlation
##
## data: EmploymentStatusDuration and OnTimeProsperPayments
## t = 4.1473, df = 5666, p-value = 3.414e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02902017 0.08093178
## sample estimates:
## cor
## 0.05501315
Based on Pearson’s correlated coefficient model, it seems as though the two quantitative variables I chose to compare, ‘EmploymentStatusDuration’ and ‘OnTimeProsperPayments’ are not very correlated, with just a 0.055 or 5.5% Pearson’s coefficient. That being said, it didn’t stop me from attempting to view their relationship via a visualization.
In this plot, we can see there is quite a bit of overplotting near the x-axis. To get a better view of the dispersion of values, I chose to add the scale_x_log10() function to plot, as well as the geom_smooth() function which adds a linear model to our visualization conveying the mean value throughout the distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 9.0 15.0 22.1 31.0 131.0 24332
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 1.000 1.422 2.000 7.000 24332
##
## 1 2 3 4 5 6 7
## 3988 1158 384 104 20 12 2
##
## Pearson's product-moment correlation
##
## data: OnTimeProsperPayments and TotalProsperLoans
## t = 75.379, df = 5666, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6943606 0.7203674
## sample estimates:
## cor
## 0.7076035
Here’s a relationship worth exploring a little further: ‘OnTimeProsperPayments’ vs. ‘TotalProsperLoans’. Running Pearson’s correlated coefficient, we get a .695 coefficient, or 69.5% correlated to one another. Our ggpairs scatter plot matrix visualization eluded us to this relationship, and our further analysis confirms.
We can see here that based on these two variables’ relationship, ‘OnTimeProsperPayments’ decrease as ‘TotalProsperLoans’, (which is the number of Prosper loans the borrower has at the time this particular loan application was created), increases. This is conveyed by the lack of overplotting for ‘OnTimeProsperPayment’ amounts as ‘TotalProsperLoans’ increases in count. When we take a look at the table statistics run, this is confirmed as we see ‘TotalProsperLoans’ counts and their aggregated borrower count:
[ 1 2 3 4 5 6 7 ]
4160 1173 368 111 25 8 3
##
## Pearson's product-moment correlation
##
## data: MonthlyLoanPayment and LoanOriginalAmount
## t = 457.44, df = 29998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9337748 0.9366128
## sample estimates:
## cor
## 0.9352089
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.5 217.7 272.8 372.7 2251.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6300 8323 12000 35000
Now this is quite interesting… After running Pearson’s correlation coefficient, we see that the borrower’s ‘MonthlyLoanPayment’ is highly correlated with the borrower’s ‘LoanOriginalAmount’, at .934 or 93%, which is to be expected. You could use either variable as an exploratory predictor for the other based on the 93% confidence coefficient from Pearson’s statistical test.
At first glance, these variables seem almost linear in nature; one could be easily fooled by this. As there is quite a correlated relationship between these two variables, variance does exist if the variables are explored a bit deeper. By adding the scale_x_log10() function to our plot, the ‘MonthlyLoanPayment’ variable, we can see that the distribution is actually slightly parabolic in nature. Even so, as we fit the visualization with a linear model, we can see the relationship truly is almost as close to linear as you can get.
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and EstimatedLoss
## t = 420.32, df = 22334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9407251 0.9436693
## sample estimates:
## cor
## 0.9422154
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1349 0.1845 0.1937 0.2511 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.042 0.072 0.081 0.112 0.366 7664
With a .945 or 94.5% correlation between ‘BorrowerRate’ and ‘EstimatedLoss’ I would expect an almost linear relationship when visualized. Im not exactly sure how ‘BorrowerRate’ is used to calculate ‘EstimatedLoss’ or vise versa but we see here the two are highly correlated with one another, and either can be used with other variables to predict the other.
##
## Pearson's product-moment correlation
##
## data: EstimatedReturn and LoanOriginalAmount
## t = -44.826, df = 22334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2992881 -0.2752238
## sample estimates:
## cor
## -0.2873013
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.182 0.075 0.092 0.096 0.117 0.267 7664
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6300 8323 12000 35000
This was an awesome bivariate exploration for me. I say this because at first visualization, the data was so busy its hard to discern any pattern. By adding a geom_smooth() layer, I could get a sense of the mean of the data, which looks pretty steady around .1 % ‘EstimatedReturn’, (1st pair of plots) but that still didnt convey any smooth change in variance within the data. To account for this, I chose to divide the ‘LoanOriginalAmount’ by various random bincounts and then round that number to the nearest whole number, then multiply by the same bincount so as to produce the original whole number value. With 5 iterations of the original plot, by adjusting the default binning count I’ve been able to convey the data in a much smoother, more discernable format in terms of recognizing trends within the data set.
##
## Pearson's product-moment correlation
##
## data: EstimatedLoss and EstimatedReturn
## t = 104.01, df = 22334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5623272 0.5799985
## sample estimates:
## cor
## 0.571229
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.042 0.072 0.081 0.112 0.366 7664
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.182 0.075 0.092 0.096 0.117 0.267 7664
Pearson’s correlation coefficient shows a relationship between these two variables at .592 or 59.2%. In the grand scheme of things, when searching for correlated variables to use as predictors, a 60% correlation is just not strong enough to use as a predictor. To show the distribution of the plot, I chose to omit the top 1% of values from the data set. Here we can see a scatter plot distribution of ‘EstimatedLoss’ vs. ‘EstimatedReturn’.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6300 8323 12000 35000
##
## Pearson's product-moment correlation
##
## data: prop_MthlyPymt and LoanOriginalAmount
## t = -54.923, df = 29998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3125205 -0.2919562
## sample estimates:
## cor
## -0.3022735
I thought it would be interesting to explore another one of the exploratory features I created. Here, I’m analyzing ‘Proportion of Monthly Payment’ vs. ‘LoanOriginalAmount’. As you can see there are quite a bit of dips and rises within the distribution of these value points. What you see now is not nearly as busy as the data was before manipulating the default binning method. And by adding the geom_smooth() function, while passing parameters stat = summary and fun.y = mean, we can see a smoother depiction of the variables relationship.
Keeping in mind my initial mission plans, barring any unexpected discoveries, I was hoping to explore any features that may effect borrower rate. Here are a few of the observed relationships I found worthy enough for discussion:
1. Borrower Rate vs. Loan Status
* I chose this plot specifically because it really opened my eyes to how I can select and filter my data for analysis. My first few explorations where quantitative in nature. Now I have the chance to filter by factor variables. Here you can see a clear distinction between ‘BorrowerRate’ counts per ‘LoanStatus’ category.
2. Borrower Rate vs. Employment Status
* Again a quantitative vs. qualitative comparison but still interesting none the less. As I had assumed going in, the ‘BorrowerRate’ for ‘Not employed’ borrowers had the highest mean borrower rate, which, if you think about it,.. makes sense.
3. Borrower Rate vs. Income Range
* What’s really interesting to see here within this relationship is that as a borrower’s income increases, their average borrower rate decreases. It would be cool to explore what the credit scores and other features of the data set they might have in common.
4. Borrower Rate vs. Borrower State
* To appease my curiousity, I wanted to compare my states average borrower rate to other states. The distribution of mean borrower rate values amongst the states, comparatively speaking, was quite interesting. There didn’t seem to appear to be much fluctuation around .1937, the borrower rate mean for the entire data set. There are a few outliers, however, among a couple states that may warrant further investigation.
(not the main feature(s) of interest)?
To my surprise, I actually found my exploration of non-featured exploratory variables to be much more fun and engaging, than the comparisons I set out to originally analyze. Here a few that caught my eye…
1. Loan Original Amount vs. Estimated Return
* This comparison, quantitative in nature, took some adjusting before finally being able to recognize the slow but steady decline in ‘EstimatedReturn’ as ‘LoanOriginalAmount’ increases. I wonder what factors or other features/variables might help cause this erosion in return?
2. Estimated Loss vs. Estimated Return
* This was a cool relationship to explore because Pearson’s correlated coefficient of .571 suggests there is some influence of one variable on the other, it was just a matter of conveying what. With my visualization, I think I’ve been able to convey the reality that as ‘EstimatedLoss’ increases, ‘EstimatedReturn’ deminishes. To get a beter look at the distribution of values I took the quantile() function and passed it as an argument to the ‘limits =’ parameter of the scale_x_continous() & scale_y_continuous() layers.
3. Proportion of Monthly Payments vs. Loan Original Amount
* I chose to explore one of the variables I created earlier in my analysis, ‘prop_MthlyPymt’ within a relationship pairing. This variable reflects a borrower’s monthly loan payment in proportion to their original loan amount. In other words:
(‘MonthlyLoanPayment’ / ‘LoanOriginalAmount’)
* With correlation coefficient of -0.302, it would seem these exploratory
variables aren't that correlated. However, -.302 does suggest some
level of influence, so I decided to dig deeper. This quantitative
comparison also took some massaging before being able to render
something discernable. From the visualization, we can see drastic
rises and falls, or 'spikes' around 0.025, 0.05, and 0.075 percent.
It would be interesting to find out what may have caused those spikes;
is it something external from the data set, like the market? Or is it
an accompanying feature of the data set that has a negative effect on
the relationship's particular value?
The strongest relationship that I explored was ‘BorrowerRate’ vs. ‘EstimatedLoss’; this quantitative relationship had a correlation coefficient of .945 or 94.5%. One would categorize this relationship as favorable, and could even use either exploratory variable as a predictor for the other due to their strong correlation.
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and BorrowerAPR
## t = 1217.3, df = 29990, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9898046 0.9902536
## sample estimates:
## cor
## 0.9900316
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1349 0.1845 0.1937 0.2511 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.01325 0.15713 0.21156 0.21980 0.28544 0.51229 8
Here we have a plot of ‘BorrowerRate’ vs ‘Borrower APR’ filtered by ‘IsBorrowerHomeowner’. Based on the Pearson’s correlated coefficient of 0.990, which infers a 99% match to a linear relationship, we can visually see the linear relationship betweeen ‘BorrowerRate’ and ‘BorrowerAPR’. The distinction between is the borrower a home owner or not was created using the exploratory variable ‘IsBorrowerHomeowner’ as a filter on the distribution of values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3196 4617 5658 6833 1750003
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1349 0.1845 0.1937 0.2511 0.4975
##
## Pearson's product-moment correlation
##
## data: StatedMonthlyIncome and BorrowerRate
## t = -9.4292, df = 29998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06563661 -0.04307160
## sample estimates:
## cor
## -0.05436105
I was interested in exploring the idea of whether or not income affected borrower rates, and if so how? This is what I found:
When comparing ‘StatedMonthlyIncome’ to ‘BorrowerRate’, I had to adjust the plot quite a bit due to the amount of overplotting. I scaled the ‘x’ variable, in this case ‘StatedMonthlyIncome’, to get a better visual of the data. Once I did that, it was much easier to visually understand the relationship or lack there of, considering the calculated Pearson coefficient is -0.0544. Adding a filter, something simple, like ‘IsBorrowerHomeowner’ really provided a nice twist to the visualization. We can clearly see spikes at ‘StatedMonthlyIncome’ amounts of 25000, 30000, 35000, 40000, 45000, 50000, and then less sporadically at 70000 and beyond.
Now keep in mind these values may seem a bit off, as these are stated monthly income amounts. However, in my code manipulation of the ‘x’ variable, I multiplied by 950 (almost a 1000), so inessence the original values, displayed as $xxxx.xx, have one less ‘thousands’placed digit, so to compensate for that,
you can just remove a ’0’ from thex-axis parameters to get a truer idea of stated monthly income. Sorry for the mental math explanation… :/
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 25.00 67.00 96.03 137.00 755.00 2025
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 9.0 15.0 22.1 31.0 131.0 24332
##
## Pearson's product-moment correlation
##
## data: EmploymentStatusDuration and OnTimeProsperPayments
## t = 4.1473, df = 5666, p-value = 3.414e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02902017 0.08093178
## sample estimates:
## cor
## 0.05501315
I thought this was really cool relationship to explore, ‘EmploymentStatusDuration’ vs. ‘OnTimeProsperPayments’ filtered by ‘IncomeRange’ and ‘EmploymentStatus’.
After calculating Pearson’s coefficient, 0.0550, I still wanted to explore the relationship between the number of days consecutively a borrower has been employed, versus the number of Prosper loan payments the borrower has made consecutively, and lastly filtered by income range or employment status. One would think there has to be some kind of relationship right? It turns out, not
so much.
Figure 1 just conveys the relationship between ‘EmploymentStatusDuration’ and ‘OnTimeProsperPayments’. Some adjustments to the variables were needed but its clear to see how sporadic the value distribution is.
Figure 2 filters the visualization of figure 1 by ‘IncomeRange’. Although a bit ‘helter-skelter’, we clearly see the distinction in value patterns for each category of ‘IncomeRange’. What’s interesting here is that there’s a spike in ‘OnTimeProsperPayments’ right around 170 straight days of employment for borrowers within the income range of $1 - $24,999. This anomaly definitely warrants further investigation considering none of the other income ranges surpass 50 consecutive ‘OnTimeProsperPayments’.
Same goes for figure 3, except this visualization is filtered by ‘EmploymentStatus’. Again, we can clearly see the distinction between ‘EmploymentStatus’‘categorical value patterns within the distribution. What I do find odd about this however, and maybe this just a perceptional ’thing’, but shouldn’t ’OnTimeProsperPayments’ increase as ‘EmploymentStatusDuration’ increases? That doesnt really seem to be the trend in this data, and that might be worth further exploration… Maybe the borrowers are still ‘employed’ but have a different occupation, who knows.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.182 0.118 0.162 0.170 0.225 0.320 7664
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.042 0.072 0.081 0.112 0.366 7664
##
## Pearson's product-moment correlation
##
## data: EstimatedEffectiveYield and EstimatedLoss
## t = 190.15, df = 22334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7811803 0.7911963
## sample estimates:
## cor
## 0.7862399
Oh wow, this is interesting. So we have a decently correlated set of quantitative variables, in ‘EstimatedEffectiveYield’ and ‘EstimatedLoss’, with correlation coefficient of 0.7862, which is almost a linear relationship but not quite.
This visualization conveys that near linear relationship amongst the value distribution, and more insightfully filters it by borrower state. Based on what I see here, it looks as though the point values within the distribution that are the most linear happen to be states marked by pink or purple points in the legend of the visualization. This group of values have an ‘EstimatedEffectiveYield’ range of .5 to .3.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.042 0.072 0.081 0.112 0.366 7664
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.182 0.075 0.092 0.096 0.117 0.267 7664
##
## Pearson's product-moment correlation
##
## data: EstimatedLoss and EstimatedReturn
## t = 104.01, df = 22334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5623272 0.5799985
## sample estimates:
## cor
## 0.571229
Ok, so you caught me! I really liked this visualization in the bivariate exploration section of this report, so I thought adding another variable into the equation would really make this plot stand out. And it did. It also doesn’t hurt that the correlation coefficient between ‘EstimatedLoss’ and ‘EstimatedReturn’ is 0.5712, suggesting some semblance of a relationship. Why not throw a categorical variable in the mix and see if we can truly uncover some hidden insights?
Here in figure 1 I compared two quantitative variables by nature, in ‘EstimatedReturn’ and ‘EstimatedLoss’ by plotting their distribution relationship and filtering it by the exploratory and factor variable ‘Occupation’. We can see a dense value distribution or overplotting between 0 and 0.15% ‘EstimatedLoss’.
Figure 2 does a much butter job of conveying the vertical overplotting that occurs throughout the distribution of values. I manipulated figure 1’s plot and added the scale_x_sqrt() function as a layer to scale the x-axis.
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and EstimatedReturn
## t = -44.826, df = 22334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2992881 -0.2752238
## sample estimates:
## cor
## -0.2873013
Ahhh, the culmination of my exploration efforts, as I attempt to be ambitious here. I tried comparing quantitative variables with a negative correlated coefficient of -0.2873 using the scatter plot method to convey the relationship in a faceted grid.
In figure 1, ‘LoanOriginalAmount’ and ‘EstimatedReturn’ are suggested to have a negative relationship based on the statistical evidence, however, the facet wrapped, scatter plot visualization of their value distribution makes it hard to see the distinction between the ‘ReasonForLoan’ variable. The only thing I can really see is the variance in ‘EstimatedReturn’ for some of the reasons for a loan.
Figure 2 attempts to convey the same visualization but instead uses a line plot faceted by ‘ReasonForLoan’. By also adding the geom_smooth() layer to this plot, we can see a better generalized trend within the data distribution. One thing is certainly clear accross all levels of ‘ReasonForLoan’, as the loan original amount increases, the estimated return percentage on that loan increases as well. To further explore and understand the variance in estimated returns for each level of ‘ReasonForLoan’, more manipulation on the ‘x’ variable would be needed.
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and StatedMonthlyIncome
## t = -9.4292, df = 29998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06563661 -0.04307160
## sample estimates:
## cor
## -0.05436105
In my last exploration of multivariate plots, I chose to compare the ‘BorrowerRate’ vs. ‘StatedMonthlyIncome’ filtered and faceted by both ‘CreditRange.bucket’ and ‘CreditGrade’.
Here in figure 1 we can see a value distribution of ‘BorrowerRate’ versus ‘StatedMonthlyIncome’ and filtered by ‘CreditRange.bucket’. This visualization was then faceted by ‘CreditGrade’ to show the data distribution via each level. There seems to be quite a bit of overplotting in the unmarked categorical level ’ ’ within ‘CreditGrade’. I think this is due to the credit grade rating only being available to listings pre-2009. So this ’ ’ category in ‘CreditGrade’ and the densely distributed or overplotted, and unlabeled visualization in figure 1 most likely represents borrowers within the data set whose ‘ListingCreationDates’ are post 2009, where a different rating was used.
In figure 2, we used the same comparison, but switched our filters and facet wrapping variable. Here, I chose to use ‘CreditGrade’ as my filter, and facet wrapped the visualization using ‘CreditRange.bucket’. I also chose to use line plot method in hopes of conveying a better look at the relationship’s distribution. What seems interesting and may warrant further exploration, is the fact that income range buckets (650-700] & (850-900] showed the largest spikes in stated monthly incomes, hitting near $8000 and $10000 respectively.
your feature(s) of interest?
Based on my multivariate analysis of this data set, I would have to say there are a few relationships that I explored were I thought the feature(s) of interest were strengthened by accompanying exploratory variables. For instance:
Borrower Rate vs. Stated Monthly Income, filtered by Is Borrower Homeowner
* For as simple as this plot is, I found it quite telling and insightful. Its clear to see the trend that exists within the visualization; borrowers who are home owners, seem to have decreasing borrower rates, as their stated monthly income increases versus those borrowers who aren’t home owners, we see their borrower rates increasing as stated monthly income increases as well. When calculating the Pearson coefficient, statistical evidence revealed a negative relationship, -0.0544 and not much correlation, which appears to be true but I can’t help but wonder about the visibly decreasing and increasing borrower rates. And they say there isn’t much correlation?
I found the relationship between ‘OnTimeProsperPayments’ and ‘EmploymentStatusDuration’, filtered by ‘IncomeRange’ to have the most surprising effect on my exploration. I wasn’t expecting it at all really. In fact, I was thinking the complete opposite. I had originally assumed that the longer a borrower’s employment status duration was, the likely hood their on time prosper payments would increase. This wasn’t really the case here. What I found odd was that for borrowers with income ranges above $50,000 there seemed to be a peak 50 consecutive prosper loan payments and nothing beyond that threshold. What makes the anomaly intriguing is that only the income range from $1 - $24,999 hit the mark of 50 plus consectutive on time loan payments.
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and (as.numeric(LoanStatus))
## t = -2.9, df = 29998, p-value = 0.003735
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.028051860 -0.005426321
## sample estimates:
## cor
## -0.01674123
## # A tibble: 12 x 8
## LoanStatus borrower_rate_m… borrower_rate_m… borrower_rate_m…
## <fct> <dbl> <dbl> <dbl>
## 1 Cancelled 0.184 0.2 0.108
## 2 Chargedoff 0.235 0.24 0.01
## 3 Completed 0.186 0.174 0
## 4 Current 0.184 0.176 0.0577
## 5 Defaulted 0.223 0.230 0
## 6 FinalPaym… 0.197 0.190 0.0629
## 7 Past Due … 0.253 0.255 0.145
## 8 Past Due … 0.231 0.232 0.0749
## 9 Past Due … 0.235 0.242 0.0599
## 10 Past Due … 0.233 0.247 0.0649
## 11 Past Due … 0.240 0.247 0.0659
## 12 Past Due … 0.238 0.250 0.0766
## # ... with 4 more variables: borrower_rate_max <dbl>,
## # monthly_income_mean <dbl>, loan_amount_mean <dbl>, n <int>
## LoanStatus borrower_rate_mean borrower_rate_median
## Cancelled :1 Min. :0.1838 Min. :0.1744
## Chargedoff :1 1st Qu.:0.1943 1st Qu.:0.1975
## Completed :1 Median :0.2319 Median :0.2359
## Current :1 Mean :0.2200 Mean :0.2235
## Defaulted :1 3rd Qu.:0.2361 3rd Qu.:0.2468
## FinalPaymentInProgress:1 Max. :0.2527 Max. :0.2551
## (Other) :6
## borrower_rate_min borrower_rate_max monthly_income_mean loan_amount_mean
## Min. :0.00000 Min. :0.2375 Min. :2609 Min. : 1700
## 1st Qu.:0.04578 1st Qu.:0.3278 1st Qu.:4456 1st Qu.: 6465
## Median :0.06390 Median :0.3304 Median :5324 Median : 8080
## Mean :0.06043 Mean :0.3609 Mean :4966 Mean : 7388
## 3rd Qu.:0.07533 3rd Qu.:0.3701 3rd Qu.:5502 3rd Qu.: 8377
## Max. :0.14490 Max. :0.4975 Max. :6312 Max. :10361
##
## n
## Min. : 5
## 1st Qu.: 250
## Median : 338
## Mean : 9495
## 3rd Qu.: 6762
## Max. :56576
##
I chose this visualization as one of my final plots to represent my exploration because it’s quite telling. There are few things here that, for me, immediately raised a few questions, some of which I had a chance to explore, and others, not yet.
I figure 1, I was interested in finding out the distribution of borrower rates by loan status. I was hoping to find some relationship between the status of the loan and if it effected the borrower rate in any way. What I found however, was a count of borrower rates, for each category of ‘LoanStatus’ in my visualization. It wasn’t giving me the answers I wanted. That’s what led me to create the ‘lnStats_by_LoanSts’ data frame with summarized exloratory variables I wanted to explore, for instance borrower rate mean for each group in ‘LoanStatus’. So, in figure 2, I used the ‘borrower_rate_mean’ variable I created and tah dah! This was what I was interested in. In this point plot, we can see the average, or mean borrower rate for each category of ‘LoanStatus’. What’s interesting is there’s a small grouping of ‘LoanStatus’ categories between .23 and .24 percent borrower rate, and they all seem to be ‘Past Due…’ levels. Anthing thing to note is the gap between .20 and .22 - there’s nothing at all, not a single value, which just seems quite odd for a distribution. Maybe it warrants further investigation, but not today.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001 0.1349 0.1845 0.1937 0.2511 0.4975
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 164 1894 8564 8082 4476
## $100,000+ Not displayed Not employed
## 4547 2056 217
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and (as.numeric(IncomeRange))
## t = -27.361, df = 29998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1670623 -0.1449814
## sample estimates:
## cor
## -0.1560413
My initial game plan was to explore potential, or what I thought might have a potential influence on borrower rates. This plot, though relatively simplistic in its nature really helped my analysis in two areas: First, it confirmed some of the preconceptions that I had about the data. For example, I had assumed that richer borrowers would have lower interest rates, or borrower rates (for the purpose of this project, :) ) due to affluency. And second, it helped me truly organize my analysis and what exploratory variables I would analyze. I understand the purpose of the ggpairs() function and what it does. But sometimes, scatter plot matrices have a difficult time conveying relationships between quantitative vs qualitative variables. At least for me, I have a hard time discerning their relationships. Using categorical/ factor variables really adds a strong metric to visualizations. It allows you to filter your data quite selectively. Enough of why I chose this plot, “To the explanation!:”
In figure 1 the visualization conveys the borrower rate counts for each level in ‘IncomeRange’. What’s great about using the facet_wrap() function as an additional layer to my plot is that you can compare each level’s distribution to itself, or to other levels. It really disects your data in truly insightful ways. Because of this, we can clearly see income ranges between $25,000 - $74,999 have the largest distribution of borrower counts. What’s also obvious, and somewhat expected is the low borrower count for ‘IncomeRange’ category, ‘Not employed’; Who wants to lend money to someone without a job?, I get it. But I’m glad, based on this data, that someone took a chance on this group of borrowers.
Figure 2, can easily be one of my top 5 plots in this project. For one, it helped pat my ego a bit, by confirming one of my preconceptions mention earlier, but two and most importantly of course, is how easy it is to compare the data for each income range to another; either min, max or mean, it doesn’t matter. Here we clearly see a downward trend in borrower rate mean as income range increases. It also looks as though ‘IncomeRange’ category ‘Not employed’ has the highest borrower rate mean of the group at just over .25%. What I’d really like to explore are the credit ranges for some of these income range groups, especially those with lower borrower rate means.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 25.00 67.00 96.03 137.00 755.00 2025
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 9.0 15.0 22.1 31.0 131.0 24332
##
## Pearson's product-moment correlation
##
## data: EmploymentStatusDuration and OnTimeProsperPayments
## t = 4.1473, df = 5666, p-value = 3.414e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02902017 0.08093178
## sample estimates:
## cor
## 0.05501315
## List of 1
## $ legend.position: chr "top"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
This was a great exploration of quantitative variables and using categorical ones to filter the distribution. I was curious in finding out if a borrower’s employment status, in terms of their tenure at work, affected whether or not they made on time loan payments. And more interestingly, are on time loan payments affected by employment status? This journey helped answer some of those questions.
In figure 1 my initial gander at the two variables expressed an interesting trend, which too, was also one of my preconceptions about the data: As employment status duration increases, so should the number of on time payments made, right? It seems so. I mean, there is some variance but the pattern seems to trend upward and increase in ‘OnTimeProsperPayments’ as ‘EmploymentStatusDuration’ increases. At first I was bit confused as why the count of on time payments wasn’t much more than 40 for the first 200 or so days. Then it dawned on me… a year is 365 days, which isn’t long when it comes to employment duration, so the low fluctuation in ‘OnTimeProsperPayments’ before 200 days makes sense. And as an employees tenure at work extends, so should the number of on time payments which is represented in the data with sudden upticks or spikes around 250 days and beyond.
Figure 2 is just a filtered plot of figure 1. I chose to use the ‘IncomeRange’ variable to see if it had any influence or effect on ‘OnTimeProsperPayments’. What I found was a bit unexpected - For the largest income range, $100,000, there seemed to be a downward trend in the number of on time payments as employment status duration increased. I also noticed that only one category in ‘IncomeRange’ had count of ‘OnTimeProsperPayments’ over 50. This partially led me to believe that value was just an anomoly withing the distribution. Further exploration is definitely needed.
Lastly, figure 3 is again a plot of “EmploymentStatusDuration’s” relationship with ‘OnTimeProsperPayments’ filtered by ‘EmploymentStatus’. Here we can see there is quite a bit of variance between employment status levels, and their counts for ‘OnTimeProsperPayments’ made. What stood out the most to me here was the category of ‘Self-employed’. It showed a gradual upward trend in the number of on time payments made. The ‘Employed’ status seemed to have the least amount of variance in its distrubution as well.
The loan data set was comprised of about 114,000 observations with 80+ variables to explore. I was originally interested in borrower rate, and what variables, if any, might effect its value. What I found was more questions than what I started with.
It was good to see that some of my preconceptions weren’t too far off, however, some discoveries where straight out of left field for me though. For example, when comparing ‘StatedMonthlyIncome’ vs. ‘BorrowerRate’ and filtering on ‘IsBorrowerHomeowner’, I noticed that borrower rates drastically reduced for homeowners who stated a monthly income of $7,000 or more. What could that be due to? Or how about the visualization conveying average borrower rates by state. I thought for sure I’d find someting interesting there but the variance in min, max and mean borrower rates didn’t fluctuate too much.
My insightful explorations that really rung loud for me were my final plot selections. I realized, to truly find insight and patterns in data, you have to leverage the categorical variables within your data set. Line charts and scatter plot graphs are useful but when you segment your data in a way that groups your exploration, its incredible what you can find or explore.
My limitations here, at least within my analysis, is that I didn’t explore the possibility of a timeseries, and maybe analyzing selected borrowers rates over time, and explore the variables that may or may not have changed during that time to determine influencers in borrowing rate, or interest rate. I could have also manipulated the data frame a bit more. It’s currently in ‘wide’ format, and my comfortability with using the ‘dplyr’ & ‘tidyr’ libraries is mediocre at best. I’m sure if I could manipulate this data more efficiently, there’s a world of insight to uncover here. As a newbie to R, hint another limitation, I found it a bit harder to debug because I just wasn’t familiar with the syntax or style of the programming language. I kept subconsciously injecting Pythonic code into my attempts at analyzing data. It would take a sec before I realized the error in my ways… lol. Don’t judge me. Lastly, my biggest frustration however was formatting my project in R markdown. I’m not to familiar with the syntax so this was difficult, especially creating multi-line headers, and tables within my report. This was fun though. Keep ’em coming. :)